Feature Selection by Higher Criticism Thresholding: 
Optimal Phase Diagram 

David Donoho 1 and Jiashun Jin 2 
department of Statistics, Stanford University 
2 Department of Statistics, Carnegie Mellon University 

December 11, 2008 

Abstract 

We consider two-class linear classification in a high-dimensional, low-sample size setting. 
Only a small fraction of the features are useful, the useful features are unknown to us, and 
each useful feature contributes weakly to the classification decision - this setting was called 
the rare/weak model (RW Model) in [TT] . 

We select features by thresholding feature z-scores. The threshold is set by higher- 
criticism (HC) [ITj . Let Hi denote the P- value associated to the i-th z-score and 7T(i) denote 
the i-th order statistic of the collection of P-values. The HC threshold (HCT) is the order 
statistic of the z-score corresponding to index i maximizing (i/n — 7i~(i))/ ^Tr^(l — n^). 
The ideal threshold optimizes the classification error. In [11] we showed that HCT was 
numerically close to the ideal threshold. 

We formalize an asymptotic framework for studying the RW model, considering a se- 
quence of problems with increasingly many features and relatively fewer observations. We 
show that along this sequence, the limiting performance of ideal HCT is essentially just as 
good as the limiting performance of ideal thresholding. Our results describe two-dimensional 
phase space, a two-dimensional diagram with coordinates quantifying "rare" and "weak" in 
the RW model. Phase space can be partitioned into two regions - one where ideal threshold 
classification is successful, and one where the features are so weak and so rare that it must 
fail. Surprisingly, the regions where ideal HCT succeeds and fails make the exact same 
partition of the phase diagram. Other threshold methods, such as FDR threshold selection, 
are successful in a substantially smaller region of the phase space than either HCT or Ideal 
thresholding. 

The False Feature Detection Rate (FDR) and local FDR of the ideal and HC threshold 
selectors have surprising phase diagrams which we also describe. Results showing asymptotic 
equivalence of HCT with ideal HCT can be found in a companion paper [12] . 
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1 Introduction 

The modern era of high-throughput data collection creates data in abundance. Some devices - 
spectrometers and gene chips come to mind - automatically generate measurements on thousands 
of standard features of each observational unit. 



What hasn't changed in science is the difficulty of obtaining good observational units. High- 
throughput devices don't help us to find and enroll qualified patients, or catch exotic butterflies, 
or observe primate mating behaviors. Hence, in many fields the number of observational units 
- eg. patients, butterflies, or matings - has not increased, and stays today in the dozens or 
hundreds. But each of those few observational units can now be subjected to a large battery of 
automatic feature measurements 

Many of those automatically measured features will have little relevance to any given project. 
This new era of feature glut poses a needle-in-a-haystack problem: we must detect a relatively 
few valuable features among many useless ones. Unfortunately, the combination of small sample 
sizes (few observational units) and high dimensions (many feature measurements) makes it hard 
to tell needles from straw. 

Orthodox statistical methods assumed a quite different set of conditions: more observations 
than features, and all features highly relevant. Modern statistical research is intensively develop- 
ing new tools and theory to address the new unorthodox setting; such research comprised much 
of the activity in the recent 6-month Newton Institute program Statistical Theory and Methods 
for Complex, High- Dimensional Data. 

In this article we focus on this new setting, this time addressing the challenges that modern 
high-dimensional data pose to linear classification schemes. New data analysis tools and new 
types of mathematical analysis of those tools will be introduced. 

1.1 Multivariate normal classification 

Consider a simple model of linear classifier training. We have a set of labelled training samples 
(Yi,Xi), i — 1, . . . , n, where each label is ±1 and each feature vector £ RP . For simplicity, 
we assume the training set contains equal numbers of l's and — l's and that the feature vectors 
Xi £ RP obey Xi ~ N(Yip,, £), i = 1, . . . , n, for an unknown mean contrast vector \i S RP; here 
S denotes the feature covariance matrix and n is the training set size. In this simple setting, one 
ordinarily uses linear classifiers to classify an unlabeled test vector X, taking the general form 
L(X) = Ysj=i w (j)X(j), for a sequence of 'feature weights' w = (w(j) : j = 1, . . . ,p). 

Classical theory going back to RA Fisher [5] shows that the optimal classifier has feature 
weights w oc £~V; at first glance linear classifier design seems straightforward and settled. 
However, in many of today's most active application areas, it is a major challenge to construct 
linear classifiers which work well. 

1.2 p larger than n 

The culprit can be called the ll p > n problem" . A large number p of measurements is automati- 
cally made on thousands of standard features, but in a given project, the number of observational 
units, n, might be in the dozens or hundreds. The fact that p> n makes it difficult or impossible 
to estimate the feature covariance £ with any precision. 

It is well known that naive application of the formula w oc to empirical data in the 

p > n setting is problematic; at a minimum, because the matrix of empirical feature covariances 
Cov n p (X) is not invertible. But even if we use the generalized inverse Cov n ^ p (X)\ the resulting 
naive classification weights, w na ive oc C'ov n . p (X)^ Cov n ^ p (Y, X), often give very "noisy" classifiers 
with low accuracy. The modern feature glut thus seriously damages the applicability of 'textbook' 
approaches. 

A by-now standard response to this problem is to simply ignore feature covariances, and 
standardize the features to mean zero and variance one. One in effect pretends that the feature 
covariance matrix £ is the identity matrix, and uses the formula w(j) oc C'ov(Y, X(j)) [51 116]. 
Even after this reduction, further challenges remain. 

1.3 When features are rare and weak 

In fields such as genomics, with a glut of feature measurements per observational unit, it is 
expected that few measured features will be useful in any given project; nevertheless, they 
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all get measured, because researchers can't say in advance which ones will be useful. Moreover, 
reported misclassification rates are relatively high. Hence there may be numerous useful features, 
but they arc relatively rare and individually quite weak. 

Such thinking motivated the following rare/weak feature model (RW Feature Model) in 
Under this model, 

• Useful features are rare: the contrast vector /i is nonzero in only k out of p elements, 
where e = k/p is small, i.e. close to zero. As an example, think of p — 10, 000, k = 100, so 
e = k/p = .01. In addition, 

• Useful features are weak: the nonzero elements of \i have common amplitude zxo, which 
is assumed not to be 'large'. 'Large' can be measured using t = y/nno; values of r in the 
range 2 to 4 imply that corresponding values of /i are not large. 

Since the elements X(j) of the feature vector where the class contrast — are entirely 
uninformative about the value of Y(j), only the k features where = ix are useful. The 
problem is how to identify and benefit from those rare, weak features. 

We speak of e and r as the sparsity and strength parameters for the Rare/ Weak model, and 
refer to the RW{e, r) model. Models with a 'sparsity' parameter e are common in estimation 
settings [Hl[T3j|27], but not with the feature strength constraint r. Also closely related to the 
RW model is work in multiple testing by Ingster and the authors [53J E0 US] . 

1.4 Feature selection by thresholding 

Feature selection - i.e. working only with an empirically-selected subset of features - is a standard 
response to feature glut. We are supposing, as announced in Section fO] that feature correlations 
can be ignored and that features are already standardized to variance one. We therefore focus 
on the vector of feature Z-scores, with components Z(j) — n~ x / 2 YiXi(j), j = l,,..,p. 
These are the Z-scores of two-sided tests of H j: Cov(Y,X(j)) = 0. Under our assumptions 
Z ~ N(9,I p ) with 9 = \fnp, and fj, the feature contrast vector. Features with nonzero 
typically have significantly nonzero Z(J) while all other features will have Z(j) values largely 
consistent with the null hypothesis — 0. In such a setting, selecting features with Z-scores 
above a threshold makes sense. 

We identify three useful threshold functions: ij$(z), * e {clip, hard, soft}. These are: Clip- 
ping - i]f zp (z) = sgn(z) • l{M>rt, which ignores the size of the Z-score, provided it is large; Hard 
Thresholding - i]^ ard (z) — z ■ lri z i >t i, which uses the size of the Z-score, provided it is large; 
and Soft Thresholding - ^"^(z) = sga(z)(\z\ — t) + , which uses a shrunken Z-score, provided it 
is large. 

Definition 1.1 Let -k g {soft, hard, clip}. The threshold feature selection classifier makes its 
decision based on L\ <> where L t {X) = Yjj=i ™t (j)X(j), and u>t(j) = i]^(Z(j)),j — 1, . . . ,p. 

In words, the classifier sums across features with large training-set Z-scores, and a simple function 
of the Z-score generates the feature weight. 

Several methods for linear classification in bioinformatics follow this approach: the Shrunken 
Centroids method [29] is a variant of soft thresholding in this two-class setting; the highly-cited 
methods in [T7] and [25] are variants of hard thresholding. Clipping makes sense in the theoretical 
setting of the RW model (since then the useful features have all the same strength) and is simpler 
to analyse than the other nonlinearitics. 

Thresholding has been popular in estimation for more than a decade [14] : it is known to be 
succesful in 'sparse' settings where the estimand has many coordinates, of which only a relatively 
few coordinates are significantly nonzero. However classification is not the same as estimation, 
and performance characteristics are driven by quite different considerations. 

One crucial question remains: how to choose the threshold based on the data? Popular 
methods for threshold choice include cross-validation [55] : control of the false discovery rate 
[U [21 21 [10] ; and control of the local false discovery rate [15] . 
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1.5 Higher Criticism 



In [TT] we proposed a method of threshold choice based on recent work in the field of multiple 
comparisons. We now very briefly mention work in that field and then introduce the threshold 
choice method. 



1.5.1 HC testing 

Suppose we have a collection of N P-valucs 7Tj which under the global null hypothesis are 
uniformly distributed: 7Tj ~ iid J7[0, 1] . Consider the order statistics: ir^ < 7T( 2 ) < ••■ < tt/n)- 
Under the null hypothesis, these order statistics have the usual properties of uniform order 
statistics, including the asymptotic normality tt^ ^ a pprox Normal (i/N, i/N(l — i/N)). HC 
forms a Z-score comparing iru-s with its mean under the null, and then maximizes over a wide 
range of i. Formally: 

Definition 1.2 (HC Testing,) [f$. The Higher Criticism objective is 

HC(i; nu)) = VN l / N ~ 7 l(!} — , n !) 

W y/i/NO- - W) 

Fix cto € (0, 1) (eg ao = 1/10). The HC test statistic is HC* = maxi<i< Qo jv HC(i; ^(i))- 

HC seems insensitive to the selection of a, in Rare/Weak situations; here we always use 
a = .10. 

In words, we look for the largest standardized discrepancy for any tt^ between the observed 
behavior and the expected behavior under the null. When this is large, the whole collection of 
P- values is not consistent with the global null hypothesis. The phrase "Higher Criticism" is due 
to John Tukey, and reflects the shift in emphasis from single test results to the whole collection 
of tests; see discussion in [9]. Note: there are several variants of HC statistic; we discuss only 
one variant in this brief note; the main results of [5] still apply to this variant; for full discussion 
see 0[TT]. 



1.5.2 HC thresholding 

Return to the classification setting in previous sections. We have a vector of feature Z-scores 
(Z(j),j = 1, ...,p). To apply HC notions, translate Z-scores into two-sided P-values, and 
maximizes the HC objective over index i in the appropriate range. Define the feature P-values 
Hi = Prob{\N(0, 1)| > |Z(z)|}, i — 1, . . . ,p; and define the increasing rearrangement tt^, the HC 
objective function HC(i\ir^), and the increasing rearrangement \Z\^ correspondingly. Here is 
our proposal. 

Definition 1.3 (HC Thresholding/ Apply the HC procedure to the feature P-values. Let the 
maximum HC objective be achieved at index i. The Higher Criticism threshold (HCT) is 
the value t HC — \Z\^y The HC threshold feature selector selects features with Z -scores 

exceeding t HC in magnitude. 

Figure [l] illustrates the procedure. Panel (a) shows a sample of Z-scores, Panel (b) shows a 
PP-plot of the corresponding ordered P-values versus i/p and Panel (c) shows a standardized 
PP-plot. The standardized PP-Plot has its largest deviation from zero at i; and this generates 
the threshold value. 



1.5.3 Previously-reported results for HCT 

Our article [11] reported several findings about behavior of HCT based on numerical and em- 
pirical evidence. In the RW model, we can define an ideal threshold, i.e. a threshold based on 
full knowledge of the RW parameters e and p, and chosen to minimize the misclassification rate 
of the threshold classifier - see Section [2] below. We showed in [TT] that: 

• HCT gives a threshold value which is numerically very close to the ideal threshold. 



4 



ordered z-scores 



10 
5 




0.02 0.04 0.06 0.08 0.1 
ordered p-values 




0.02 0.04 0.06 0.08 0.1 
HC objective function 



0.02 0.04 0.06 0.08 0.1 



Figure 1: Illustration of HC thresholding. Panel (a) the ordered |Z|-scores. Panel (b) the 
corresponding ordered P-values in a PP plot. Panel (c) the HC objective function in (1.1 1; this 
is largest at i « 0.007p (a;-axes are i/p). Vertical lines indicate ir^ in Panel (b), and in 
Panel (a). 



• In the case of very weak feature z-scores, HCT has a False Feature Discovery Rate (FDR) 
substantially higher than other popular approaches, but a Feature Missed Detection Rate 
(MDR) substantially lower than those other approaches; 

• At the same time, HCT has FDR and MDR very closely matching those of the ideal 
threshold. 

In short, HCT has very different operating characteristics from those other thresholding schemes 
like FDR thresholding [TJ [2| and Bonferroni thresholding, but very similar operating character- 
istics to the ideal threshold. 

1.6 Asymptotic RW model, and the phase diagram 

In this paper we further support the findings reported in |llj . this time using an asymptotic 
analysis. In our analysis the number of observations n and the number of features p tend to 
infinity in a linked fashion, with n remaining very small compared to p. (Empirical results in 
|llj show our large-p theory is applicable at moderate n and p) . 

More precisely, we consider a sequence of problems with increasingly more features, increas- 
ingly more rare useful features, and relatively small numbers of observations compared to the 
number of features. 

Definition 1.4 The phrase asymptotic RW model refers to the following combined assump- 
tions. 

• Asymptotic Setting. We consider a sequence of problems, where the number of obser- 
vations n and the number of features p both tend to oo along the sequence. 

• p dramatically larger than n. Along this sequence, n ~ c-log(p) 7 , so there are dramat- 
ically more features per observational unit than there are observational units. 

• Increasing Rarity. The sparsity e varies with n and p according to e = p~@ , < {3 < 1. 

• Decreasing Strength. The strength t varies with n and p according to r = \/2r log(p), 
< r < 1. 
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The symbol ARW(r, [3; c, 7) refers to the model combining these assumptions. 



In this model, because r < 1, useful features are individually too weak to detect and because 
< j3 < 1, useful features are increasingly rare with increasing p, while increasing in total number 
with p. It turns out that c and 7 are incidental, while r and (3 are the driving parameters. Hence 
we always simply write ARW(r, (3) below. 

There is a large family of choices of (r, (3) where successful classification is possible, and 
another large family of choices where it is impossible. To understand this fact, we use the 
concept of phase space, the two-dimensional domain < r, j3 < 1. We show that this domain 
is partitioned into two regions or 'phases'. In the "impossible" phase, useful features are so 
rare and so weak that classification is asymptotically impossible even with the ideal choice of 
threshold. In the "possible" phase, successfully separating the two groups is indeed possible - if 
one has access to the ideal threshold. Figure [2] displays this domain and its partition into phases. 
Because of the partition into two phases, we also call this display the phase diagram. An explicit 



formula for the graph r = p*(f3) bounding these phases is given in (3.2 1 below. 

The phase diagram provides a convenient platform for comparing different procedures. A 
threshold choice is optimal if it gives the same partition of phase space as the one obtained 
with the ideal choice of threshold. 

How does HCT compare to the ideal threshold, and what partition in the phase space does 
HCT yield? For reasons of space, we focus in this paper on the Ideal HC threshold, which is 
obtained upon replacing the empirical distribution of feature Z-scores by it expected value. The 
Ideal HC threshold is thus the threshold which HCT is 'trying to estimate'; in the companion 
paper [12 we give a full analysis showing that the ideal HCT and HCT are close. 

The central surprise of our story is that HC behaves surprisingly well: the partition of phase 
space describing the two regions where ideal thresholding fails and/or succeeds also describes 
the two regions where Ideal HCT fails and/or succeeds in classifying accurately. The situation 
is depicted in the table below: Here by 'succeeds', we mean asymptotically zero misclassification 



Region 


Property of Ideal Threshold 


Property of Ideal HCT 


r<P*((3) 
r>P*((3) 


Ideal Threshold Classifier Fails 
Ideal Threshold Classifier Succeeds 


Ideal HCT Fails 
Ideal HCT Succeeds 



rate and by 'fails', we mean asymptotically 50% misclassification rate. 

In this sense of size of regions of success, HCT is just as good as the ideal threshold. Such 
statements cannot be made for some other popular thresholding schemes, such as False Discovery 
threshold selection. As will be shown in [T^] even the very popular Cross- Validated choice of 
Threshold will fail if the training set size is bounded, while HCT will still succeed in the RW 
model in that case. 

The full proof of a broader set of claims - with a considerably more general treatment - will 
appear elsewhere. To elaborate the whole story on HCT needs three connected papers including 
|12j . and the current one. In we reported numerical results both with simulated data 
from the RW model and with certain real data often used as standard benchmarks for classifier 
performance. In |12j . we will develop a more mathematical treatment of many results we cite 
here and in The current article, logically second in the triology, develops an analysis of 

Ideal HCT which is both transparent and which provides the key insights underlying our lengthy 
arguments in |12j . We also take the time to explain the notions of phase diagram and phase 
regions. We believe this paper will be helpful to readers who want to understand HCT and its 
performance, but who would be overwhelmed by the epsilontics of the analysis in |12j . 

The paper is organized as follows. Section [2] introduces a functional framework and several 
ideal quantities. These include the proxy classification error where Fisher's separation (SEP) [3 
plays a key role, the ideal threshold as a proxy for the optimal threshold, and the ideal HCT 
as a proxy of the HCT. Section [3] introduces the main results on the asymptotic behavior of 
the HC threshold under the asymptotic RW model, and the focal point is the phase diagram. 
Section [4] outlines the basic idea behind the main results followed by the proofs. Section [5] 
discuss the connection between the ideal threshold and the ideal HCT. Section [6] discusses the 
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ideal behavior of Bonferroni threshold feature selection and FDR-controlling feature selection. 
Section [7] discusses the link between ideal HCT and ordinary HCT, the finite p phase diagram, 
and other appearances of HC in recent literature. 



2 Sep functional and ideal threshold 

Suppose L is a fixed, nonrandom linear classifier, with decision boundary L <> 0. Will L 



correctly classify the future realization (Y,X) from simple model of Section 1.3 Then Y = ±1 



equiprobable and X ~ N(Y fj,, I p ). The misclassification probability can be written 

P{YL(X) < 0| M } = $(-isep(L; (2.1) 

where $ denotes the standard normal distribution function and Sep(L; fi) measures the stan- 
dardized interclass distance: 

q (T x E{L(X)\Y = 1} - E{L(X)\Y = -1} 

Sep(L^) = SD(L(X)) (2 ' 2) 

2E^(j>(i) = 2(w,fj.) 
(Ew(j)2) i/2 | H | 2 > 

The ideal linear classifier with feature weights w oc fi, and decision threshold L M <> 
implements the likelihood ratio test. It also maximizes Sep, since for every other linear classifier 
L, Sep(L;y) < Sep{L^;n) = 2||/z|| 2 . 

2.1 Certainty-equivalent heuristic 

Threshold selection rules give random linear classifiers: the classifier weight vector w is a random 
variable, because it depends on the ^-scores of the realized training sample. If Lz denotes a 
linear classifier constructed based on such a realized vector of Z-scores, then the misclassification 
error can be written as 

Err{L z ,n) = *{~Sep{L z ;n)); (2.3) 

this is a random variable depending on Z and on /i. Heuristically, because there is a large number 
of coordinates, some statistical regularity appears, and we anticipate that random quantities can 
be replaced by expected values. We proceed as if 

Sep(L z ; fi) w — -! — —72 ( 2 - 4 ) 

where the expectation is over the conditional distribution of Z conditioned on fi. Our next step 
derives from the fact that /i itself is random, having about e • p nonzero coordinates, in random 
locations. Now as it), = T)t(Zi) we write heuristically 

Ez\n(w,lJ,) « p ■ e ■ fi Q ■ E{q t {.Zi)\m = /x }, 

while 

£z|/>,™) " P • (e • = Mo} + (!-£)• = 0}) . 

Definition 2.1 Lei £/ie threshold t be fixed and chosen independently of the training set. In the 
RW(e, t) model we use the following expressions for proxy separation 

Sep{t;e,r) = 

v a 



where 

A(t,e,T) = e-T-E{ Vt (T + W)}. 
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B(t, e,r) = e ■ E{rg (r + W)} + (1 - e)E{if t (W)}. 



and W denotes a standard normal random variable. By proxy classification error we mean 



While ordinarily, we expect averages to "behave like expectations" in large samples, we use the 
word proxy to remind us that there is a difference (presumably small). Software to compute 
these proxy expressions has been developed by the authors, and some numerical results using 
them were reported in 

Of course, the rationale for our interest in these proxy expressions is our heuristic understand- 
ing that they accurately describe the exact large-sample behavior of certain threshold selection 
schemes. This issue is settled in the affirmative, after considerable effort, in [12 . 

2.2 Certainty-equivalent threshold functionals 

In general the best threshold to use in a given instance of the RW model depends on both e and 
r. It also depends on the specific realization of /z and even of Z. However, dependence on /i and 
Z is simply "noise" that goes away in large samples, while the dependence on e and r remains. 

Definition 2.2 The ideal threshold junctional Tid ea i{^, t) maximizes the proxy separation 



Hcuristically, Tideai represents a near-optimal threshold in all sufficiently large samples; it is 
what we "ought" to be attempting to use. 

Definition 2.3 Folding. The following concepts and notations will be used in connection with 
distributions of absolute values of random variables. 

• The Half Normal distribution function ^(t) = P{|iV(0, 1)| < t}. 

• The noncentral Half-Normal distribution *5f T (t) = P{\N(t, 1)| < t}. 

• Given a distribution function F, the folded distribution is G(t) — F(t) — F(—t). The Half 
Normal is the folded version of the standard Normal, and the noncentral Half Normal is 
the folded version of a Normal with unit standard deviation and nonzero mean equal to the 
noncentrality parameter. 

• Let F e T denote the 2-point mixture 



We now define an HCT functional representing the target that HC thresholding aims for. 

Definition 2.4 Let F be a distribution function which is not the standard normal At such 
a distribution, we define the HCT functional by 




Normalizations are chosen here so that, in large samples 




T id eai(e,T) = arg max t Sep(t,e,r). 



F eiT (t) = (l-e)$(t)+c$(t-T)i 



G e r denotes the corresponding folded distribution: 



G €tT {t) = (l-e)f (i) + e$ T (t), 



Thc(F) = argmax t>ti 



\fW) ■ G(t) 
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here G is the folding of F, and to is a fixed parameter of the HCT method ( eg. to = 4> 1 (0. 1) ). 
The HC threshold in the RW(e, r) model may be written, in an abuse of notation, 

Thc(c,t) 

meaning Thc{Fc,t)- 

Let F n v d enote the usual empirical distribution of the feature Z-scores Zj. The HCT of 
Definition 



1.3 



can be written as — T HC (F n p ). Let F denote the expected value of F n p ; 
then Thc{F) will be called the ideal HC threshold. Heuristically, we expect the usual sampling 
fluctuations and that 

THc{F n , P ) ~ Thc(F) 

with a discrepancy decaying as p and n increase. This issue is carefully considered in the 
companion paper [IS], which shows that the empirical HC threshold in the ARW model indeed 
closely matches the ideal HC threshold. 

For comparison purposes, we considered two other threshold schemes. First, (ideal) False- 
Discovery Rate thresholding. For a threshold t, and parameters (p, e,r), the expected number 
of useful features selected is 

E(TP)(t;e,T,p)=p-e-G e<T (t); 

and the expected number of useless features selected is 

E(FP)(t;e,T,p)=p-{l-e)-^(t). 

Let TPR(t) = p~ 1 E(TP)(t) denote the expected rate of useful features above threshold and 
FPR(t) = p~ 1 E(FP)(t) denote the expected rate of uslcss features above threshold. In analogy 
with our earlier heuristic, we define the proxy False Discovery Rate (FDR) 

— , , FPR(t) 

FDR ^ T ^ = TPR(t) + FPR(t) 

(The term "proxy" reminds us that 

E(FP)(t) E _ FP(t) 



E{TP){t) + E(FP)(t) r TP(t) + FP(t) ' 

although for large p the difference will often be small.) 
We define the FDRT-a functional by 

Tfdr,o.{^,t) = min{i : FDR < a, t > to}. 

Heuristically, this is the threshold that FDRT is 'trying' to learn from noisy empirical data. We 
will also need the proxy Local FDR. 

rr-r , \ FPR'lt) 

Lfdr(t;e,r,p)^ TpR/{t) + FpR/{t) . 

Here FPR' denotes the derivative of FPR, which exists, using smoothness properties of ^o! 
similarly for TPR' and * r . Intuitively, Lfdr(t) denotes the expected fraction of useless features 
among those features having observed Z-scores near level t. 
Second, we considered Bonferroni-based thresholding. 

T B on - ^Cp" 1 ). 

This threshold level is set at the level that would cause on average one false alarm in a set of p 
null cases. 

In [11] Figures 2-3], we presented numerical calculations of all these functionals and their 
separation behavior in two cases. 
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• p= 10,000, and e= .01. 

• p = 10 6 and e = .0001. 

Although our calculations are exact numerical finite-p calculations, we remark that they corre- 
spond to sparsity exponents f3 — 1/2 and (3 — 2/3, respectively. The figures show the following. 

• There is a very close numerical approximation of the HCT to the ideal threshold, not just 
at large r but also even at quite small 2 < r < 3. 

• FDR and Bonferroni thresholds behave very differently from the ideal and from HC. 

• The separation behavior of the HCT is nearly ideal. For the constant FDR rules, the 
separation behavior is close to ideal at some r but becomes noticeably sub-ideal at other 
r. 

• The False discovery rate behavior of HCT and Ideal thresholding depends on r. At small 
t, both rules tolerate a high FDR while at large r, both rules obtain a small FDR. 

• The Missed detection rate of HCT and Ideal thresholding also depends on r. At small t, 
the missed detection rate is high, but noticeably less than 100%. At large r, the missed 
detection rate falls, but remains noticeably above 0%. In contrast the MDR for FDR 
procedures is essentially 100% for small r and falls below that of HCT/ideal for large r. 

These numerical examples illustrate the idealized behavior of different procedures. We can 
think of the HCT functional as the threshold which is being estimated by the actual HCT rule. 
On an actual dataset sampled from the underlying F, the HC threshold will behave differently, 
primarly due to stochastic fluctuations F n p « F. Nevertheless, the close approximation of the 
HCT threshold to the ideal one is striking and, to us, compelling. 



3 Behavior of ideal threshold, asymptotic RW model 



We now study the ideal threshold in the asymptotic RW model of Definition [L4J That is, we fix 
parameters r,/3 in that model and study the choice of threshold t maximizing class separation. 
We first make precise a structural fact about the ideal threshold, first observed informally in 

Definition 3.1 ROC Curve. The feature detection receiver operating characteristic curve 
(ROC) is the curve parameterized by (F P R(t) , T P R(t)) . The tangent to this curve at t is 

, . TPR'it) 

tan(t) = H 

V ; FPR'(t) 

and the secant is 

, . TPR(t) 
v ; FPR(t) 

Note that in the RW(e, r) model, TPR, FPR, tan and sec all depend on t, e,r and p, although 
we may, as here, indicate only dependence on t. 

Theorem 1 Tangent- Secant Rule. In the ARW(r, (3) model, we have 
tan(T Ideal ) 1 ( ^ t sec(T Ideal ) \ _ 



1 + tan(T Ideal ) 2 \ 1 + sec(T Ideal ) 



Here e — p 13 , t — yj2r log(p) and n ~ clog(p) 7 , p — > oo as in Definition 1.4 



Definition 3.2 Success Region. The region of asymptotically successful ideal threshold fea- 
ture selection in the (/3, r) plane is the interior of the subset where the ideal threshold choice 
T idea i(e,T) obeys 

Err(T idea i(e,T);e,T,p) — > oo, p — > oo; 



here we are in the ARW(r, (3) model of Definition 1.4 
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The interesting range involves (0, r) £ [0, l] 2 . The following function is important for our 
analysis, and has previously appeared in important roles in other (seemingly unrelated) problems; 
see Section [731 

( 0, < < 1/2, 

p*(f3) = I 0-1/2, l/2</3<3/4, (3.2) 

( (l-^0) 2 , 3/4</3<l. 

As it turns out, it marks the boundary between success and failure for threshold feature selection. 

Theorem 2 Existence of Phases. The success region is precisely r > p*(f3), < < 1. In 
the interior of the complementary region r < p*(0), 1/2 < < 1, even the ideal threshold cannot 
send the proxy separation to infinity with increasing (n,p). 

Definition 3.3 Regions I, II, III. The Success Region can be split into three regions, referred 
to here and below as Regions I-III. The interiors of the regions are as follows: 

I. 0- 1/2 <r < 0/3 and 1/2 < < 3/4; r > p*([3). 

II. 0/3 < r < 13 and 1/2 < < 1; r > p*{0). 

III. < r < 1 and 1/2 < < 1; r > p*{0). 
See Figure^ 




Figure 2: Phase diagram. The curve r = p*(0) splits the phase space into the failure region and 
the success region, and the latter further splits into three different regions I, II, III. Numbers in 
the brackets show limits of FDR and Lfdr at the ideal HCT as in Theorem 4. 

In the asymptotic RW model, the optimal threshold must behave asymptotically like ^2qlog(p) 
for a certain q — q(r, 0). Surprisingly we need not have q = r. 
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Theorem 3 Formula for Ideal Threshold. Under the Asymptotic RW model ARW(r, ($), 
with r > p*{/3), the ideal threshold has the form Tid ea i(e, t) ~ y/2q* log(p) where 




Region L 

J (3.3) 
Region II, III. 

Note in particular that in Regions I and II, q* > r, and hence T idea i(t,T) > r. Although the 
features truly have strength r, the threshold is best set higher than r. 

We now turn to FDR properties. The Tangent-Secant rule implies immediately 

Lfdr(T Idea i) 

• 1, p — > oo. (3.4) 



1, 


Region 


I 


/3-r 
2r ' 


Region 


II 


o, 


Region 


III 


obeys 






1, 


Region 


I 


r+f) 
4r ' 


Region 


II 


1/2, 


Region 


III 



(1 + FDR{T Ideal ))/2 
Hence any result about FDR is tied to one about local FDR, and vice versa. 

Theorem 4 Under the Asymptotic RW model ARW(r,p3), at the ideal threshold Ti dea i(t, t) 
proxy FDR obeys 

FDR(T ldeal ,e, T ) -> { Region II {note 0/3 < r < (3) (3.5) 



as p 



Lfdr(T ideal ,e,T)-y { ' . lb .,;<>„ (I ( uol, I > >! r I) (.'!.()! 

as p — > oo . 

Several aspects of the above solution are of interest. 

• Threshold Elevation. The threshold \/2q* log(p) is significantly higher than y/2r log(p) in 
Regions I and II. Instead of looking for features at the amplitude they can be expected to 
have, we look for them at much higher amplitudes. 

• Fractional Harvesting. Outside of Region III, we arc selecting only a small fraction of the 
truly useful features. 

• False Discovery Rate. Outside Region III, we actually have a very large false discovery 
rate, which is very close to 1 in Region I. Surprisingly even though most of the selected 
features are useless, we still correctly classify! 

• Training versus Test performance. The quantity yjq can be interpreted as a ratio: y/q* /r = 
min{^£, 2} = strength of useful features in training / strength of those features in test. 
From ( |3.3[ ) we learn that, in Region I, the selected useful features perform about half as 
well in training as we might expect from their performance in test. 



4 Behavior of ideal clipping threshold 

We now sketch some of the arguments involved in the full proof of the theorems stated above. 
In the RW model, it makes particular sense to use the clipping threshold function % hp , since all 
nonzeros are known to have the same amplitude. The ideal clipping threshold is also very easy to 
analyze heuristically. But it turns out that all the statements in Theorems 1-3 are equally valid 
for all three types of threshold functions, so we prefer to explain the derivations using clipping. 
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4.1 Sep in terms of true and false discoveries 

In the RW model, we can express the components of the proxy separation very simply when 
using the clipping threshold: 

A cUp (t,e,T) = e ■ t ■ Esgn(r + W)l {lT+wl>t} . 

Bcii P (t, e, t) = e ■ E1^ T+W \ >t } + (1 — e) ■ El^ w \>t} 

where W denotes an N(0, 1) random variable. Recall the definitions of useful selections TP and 
useless selections FP; we also must count Inverted Detections, for the case where the //,, > but 
rjf < 0. Put 

E(ID)(t;e,T,p) = e- p- $(-t- t), 

with again $ the standard normal distribution, and define the inverted detecton rate by IDR = 
p-^E{ID). Then 

A cUp {t; e, t) = t ■ (TPR(t) - 2IDR)(t)) . 

B cUp (t; e, t) = TPR(t) + FPR(t). 

We arrive at an identity for Sep in the case of clipping: 

~ (2^pr) ■ (TPR(t) - 2IDR(t)) 

bep[t;e,T) = . . 

y/TPR(t) + FPR(t) 

We now explain Theorem 1, the Tangent-Secant rule. Consider the alternate proxy 

7=—, , (2t) • TPRft) 2A(t) 

Septt) = , y ' y = , , say; 

W y/TPR(t) + FPR(t) SV2( t )' 

i.e. drop the term IDR. It turns out that for the alternate proxy, the tangent secant rule and 
resulting FDR-Lfdr balance equation are exact identitcs. 

Lemma 4.1 Let e > and r > 0. The threshold t a n maximizing Sep(t) as a function of t 
satisfies the Tangent- Secant rule as an exact identity; at this threshold we have 

Lfdr(t alt ) = i (1 + FDR(t alt )) . (4.1) 

Proof. Now, A and B are both smooth functions of t, so at the t optimizing AB -1 / 2 we 
have 

1/2 /,, I A ,\ B' (A! I A 

= B- 1 ' 2 A' B = —r-r • 

2B ) B 1 / 2 \B' 2B 



By inspection B'(t) < for every t > 0. Hence, 

A' _ 1 A 
Vb> ~ 2t~B' 

The Tangent-Secant Rule follows. We now remark that 

TPR(t) A 



(4.2) 



FDR(t) = 1 



TPR(t) + FPR{t) t B' 
and 

Lfdr(t) = l TPRm_ ^_ 

J y ' TPR'(t) + FPR'(t) t ■ B< 



Display (4.1) follows □ 
The fulljDroof of Theorem 1, which we omit, simply shows that the discrepancy caused 
by Sep ^ Sep has an asymptotically negligible effect; the two objectives have very similar 
maximizers. 
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4.2 Analysis in the asymptotic RW model 

We now invoke the ARW(r,(3) model: e = p^ 13 , r = ^/2r log(p), (n,p) — > oo, n ~ clog(p) 7 , 
p — * oo. Let tp(q) = y / 2qlog(p). The classical Mills' ratio can be written in terms of the Normal 
survival function as: 

$(tp(g)) ~P~ q /t p (q), p->oo. 
Correspondingly, the half Normal obeys: 

%{t p {q))~2-p-yt p (q), p-*oc. (4.3) 

We also need a notation for poly-log terms. 

Definition 4.1 Any occurrence of the symbol PL(p) denotes a term which is 0(log(p)^) and 
r2(log(p) _ ^) as p — > oo for some ( > 0. Different occurrences of this symbol may stand for 
different such terms. 

In particular, we may well have T\{p) = PL(p), Ta(p) = PL(p), as p — + oo, and yet ^l^p) 1 as 

p — ► oo . However, certainly ^[^y = PL(p), p — ■+ oo. 

The following Lemma exposes the main phenomena driving Theorems 1-4. It follows by 
simple algebra, and several uses of Mills' Ratio (4.3 1 in the convenient form ^>o(t p (q)) — PL(p) ■ 
p-P. 

Lemma 4.2 In the asymptotic RW model ARW(r, 0), we have: 

1. Quasi power-law for useful feature discoveries: 

E(TP)(t q {p),e,r) = PL{p)-p s ^\ p^oo. (4.4) 

where the useful feature discovery exponent 8 obeys 

xf a „s _ / 1 - P, 0<q<r, 
6{q;P,r) = j 1 _ /3 _ ( ^_^ )2j r < g < 1. 

2. Quasi power-law for useless feature discoveries: 

E(FP)(t q (p),e,T) = PL(p)-p 1 -«, p^oo, (4.5) 

3. Negligibility of inverted detections: 

E{ID){t q {p),c 1 T) = o{E{TP){t q {p),e,T)) 1 p - oo. (4.6) 

As an immediate corollary, under ARW(r, f3), we have: 

Sep(t q (p),e, t) ■ ( — ) = PL(p) • , / „ . =, P^oo. (4.7) 

On the right side of this display, the poly-log term is relatively unimportant. The driving effect 
is the power-law behavior of the fraction. The following Lemma contains the core idea behind 
the appearance of p* in Theorems 1 and 2, and the distinction between Regions I and Region 
II,III. 

Lemma 4.3 Let (f3,r) 6 (i, l) 2 . Let 7(5; r,/3) denote the rate at which 



y/pS(q;P,r) _|_ pl-q 



tends to 00 as p — ► 00, for fixed q, r, and (3. Then 7 > if and only if r > p*(/3). A choice of q 
maximizing this ratio is given by (3.3). 
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Proof Sketch. By inspection of S, it is enough to consider q < 1. The ratio grows to infinity 
with p like p 7 , where 

7 (<Z; r, /3) = %; /3, r) - max((l - g), %; /?, r))/2; 

provided there exists q G [0, 1] obeying 7(5; r, /?) > 0. 

Let q*(r,(3) denote the value of q maximizing the rate of separation: 

q*(r,(3) = argmax ?e[0)1] 7(g;r,/3); 

in the event of ties, we take the smallest value of q. Let's define p*(/3) without recourse to the 



earlier formula (3.2 1 but instead by the functional role claimed for it by this lemma: 

P*{I3) = inf{r : 7(3* (r, /?); r, /?) > 0,r > 0}. 
We will derive the earlier formula ( |3.2[ ) from this. Now 7 = 111111(7!, I2) where 
71(g) = % /3, r) - (1 - g)/2; and 72 (g) = %; /?, r)/2. 

We have 

j(q*(r,/3);r,/3) = max min 7,(9; r/3). 

qS[0,l] 1=1,2 

In dealing with this maximin, two special choices of q will recur below. 



(4.8) 



(4.9) 



• q\\ Viewed as a function of q, 71 is maximized on the interval [r, 1] (use calculus!) at 
5i(r, 0) = 4r, and is monotone on either side of the maximum. 

• q2'- On the other hand, 72 is monotone decreasing as a function of q on [r, 1]. Hence the 
maximizing value of q in (4.9 1 over the set of q-values where 72 achieves the minimum will 



occur at the minimal value of q achieving the minimum, i.e. at the solution to: 

(l-q) = 6(q;0r). (4.10) 



(4.10) is satisfied uniquely on [r, 1] by q%{r, 0) = (/3 + r) 2 /4r. 



The behavior of min(7i, 72) varies by cases; see Table 4.2 To see how the table was derived, 
note that 

71 < 72 iff 5 < 1 — q. 



Consider the first and second rows. For q > r, 8 = 1 — q — (3 — r + 2^/rq. Hence S < 1 — q on 
[r, 1] iff —0 — r + 2^/rq < iff q < q2- Consider the third and fourth rows. For q < r, 5 — - 1 — (3. 
Hence 5 < 1 — q on [0, r] iff > q. 



Range 


Minimizing 7^ 


r < q < q2 


7i 


max(r, q 2 ) < q 


72 


(3 < q < r 


72 


q < min(r, (3) 


7i 



Table 1: Minimizing 7$ in (4.9) 



Derivatives ji 



dq 



1,2, are laid out in Table 



4.2 



Table 4.2 presents results of formally combining the two previous tables. There are four 



different cases, depending on the ordering of qi,q2,/3 and r. In only one case does the above 



information leave q* undefined. (We note that this is a purely formal calculation; Lemma 4.4 
Display (4.131 below shows that rows 2 and 4 never occur.) 



To see how Table 4.2 is derived, consider the first row. Using the derivative table above, 
we see that min(7i,72) is increasing on [0,/3], constant on \j3,r] and decreasing on [r, 1]. Hence 
the maximin value is achieved at any q € [f3,r]. For row 2, min(7i,72) is increasing on [0,/3], 
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q < r 


q > r 


7i 

72 


1 

2 







Table 2: Derivatives of ji 



Case 


Minimizing 7^ 


Maximin q 


Maximin value 


< r, q 2 < r 


< 


7i (0,/?) 
. 72 (ftl) 


q* £ [0, r] 


li(0), 72(92) 






' 7i (0,f3) 






< r, q 2 > r 




72 (Ar) 
7i (r,q2) 

, 72 (92,1) 


q* = min(qi,q 2 ) 


7i(9*) 


> r, q 2 > r 


1 


' 72 (0,r) 
7i (r,q 2 ) 
< 72 (?2,1) 


q* = min(gi,q 2 ) 


7i(9*) 


(3 > r, q 2 < r 


1 


7i (0,r) 
. 72 (r,l) 


<f = r 


7i W 



Table 3: Maximin Behavior in (4.9 1 



constant on on \J3,r] increasing on [r, min(gi, 9 2 )] and decreasing on [min(gi, 52), 1]- For row 
3, min(7 1 ,7 2 ) is constant on [Q,r], increasing on [r, min(<7i, g 2 )] and monotone decreasing on 
[min(<7i, q 2 ), 1]. For row 4, min(7i,7 2 ) is increasing on [0,r] and decreasing on [r, 1]. 

We are trying to find circumstances where 7 < 0. In the above table, we remarked that 
the hypotheses of rows 2 and 4 can never occur. We can see that in row 1, ji(0;r, (3) > for 
(3 £ [0,1], r > (3. This leaves only row 3 where we might have 7 < 0; in that case either q* = q± 
or q* — q 2 . Writing out explicitly 



7i(«;r,/3) = l-/?+(V?- Vr)\-{l-q)/2. 



• Case q* 



qi- 



~ /1 (q 1 ;r,0) = l/2-0 + r. 
Hence 71(91; r, (3) — along r = — 1/2, and 71(91; r, /3) < for r < [3 — 1/2. 

Consider 1/2 < (3 < 3/4. In this range, Lemma 4.4 shows r < qi((3 — 1/2, j3) < q 2 (0 - 
1/2, 0) < 1. Hence 9*(r,/3) = 9i(r,/3) for r < 0-1/2, G (1/2,3/4). We conclude that 



p*(/?)=/?-l/2, 1/2 < < 3/4. 



(4.11) 



• Case q* — q 2 . 

7i( g2 ;r,/3) = l/2-(/3 + r) 2 /8r. 

We have 71(92; r,0) = along r = (1 — ^1 -/3) 2 , and 71(92; r,/3) < for < r < 
(1 - x/W) 2 . 

Consider 3/4 < < 1. In this range, Lemma 4.4 shows shows r < q 2 ((l — \/l — /3) 2 , /3) < 
9i ((1 - % /T^) 2 ,/3) < 1. We conclude that 



3*08) = (1- yi^) 2 , 3/4 < /3 < 1/2. 



(4.12) 



Together (4.8|, (4.11), and (4.12) establish that the formula (3.2| has the properties implied by 
the Lemma. 



To complete the Lemma, we need to validate formula (3.3 1. This follows from rows 1 and 3 
of Table O and Lemma WM □ 



16 



Lemma 4.4 Let q\(r, /3) 
< (3 < 1 we have 



4r and q 2 {r,j3) — {(3 + r) 2 /4r just as in the previous lemma. For 



qi < <?2 

<?2 < 1 

r < <72 



iff 
iff 
iff 



r < (3/3 
r>(l-y/T 

r < (3 



Proof. School algebra. 
Lemmas 4.3 and 



4.4 



(4.13) 
(4.14) 

□ 

show that (3.3 1 gives us one choice of threshold maximizing the rate at 
which Sep tends to infinity; is it the only choice! Except in the case, r > [3 and q 2 < r, this is 
indeed the only choice. As Table 4.2 shows, in case r > (3 and q 2 < r, any q £ [13, r] optimizes 
the of separation. It turns out that in that case, our formula q* not only maximizes the rate of 
separation, it correctly describes the leading-order asymptotic behavior of Tideai- The key point 
is the Tangent-Secant Formula, which picks out from among all q € [f3,r] uniquely q 2 . This is 
shown by the next two lemmas, which thereby complete the proof of Theorem 3. 



Lemma 4.5 Set 70 (g; r, (3) — —(3 — r - 
the threshold t q {p). Suppose that q > r 

TPR{t q {p)) 
FPR(t q (p)) ~ J 

Then 

TPR{t q {p)) 
FPR(tJp)) ' 



2^/rq, for q G (0, 1). In the ARW{r, (3) model, consider 
Then 



7o(g,' - ,/3) 



IJq-IJr 



Suppose that q < r. 



Suppose that q =/= r. The 



t g (p) 



00. 



TPR'(t q (p)) 
FPR'(t q (p)) 2 



■P 



rto(q,r,0) 



(4.15) 



(4.16) 



(4.17) 



Proof. Simple manipulations with Mills' ratio, this time not simply grouping poly log terms 
together with the symbol PL, give that for the threshold under the ARW model, if q > 0, 



If q > r 



FPR{t q {p)) 



TPR(t q (p)) 



7T 

2?r 



V 



V 2 9iog(p) 



p 



00. 



p 



-/Mx/^-v^) 2 



(^q-^)V^Sip)' 



Display (4.151 follows. If < q < r 

TPR{t q {p))) 



P 



00. 



Display (4.161 follows. If q 7^ r we have the exact identities: 



TPR\t q (p)) = 



P 



FPR'{t q {p)) = 



■P 



Display (4.171 follows. 



□ 



Lemma 4.6 q 2 (r,P) is the unique solution of jo(q;r, (3) ■ 

Tldeal/tq 2 (j>) — > 1, P 



0. Suppose r > (3 and (3 < q 2 < r. 

(4.18) 



oo. 



Proof. By inspection for q < q 2 , 70 < while for q > q 2 , 70 > 0. So from (4.171, Lfdr{t q {p)) 
tends to or 1 depending on q < q 2 or q > q 2 . Fix n > 0. If T/deai < tqo-r) infinitely often as 
p — > 00, then would be a cluster point of Lfdr(Tid e ai)- Similarly, if Tideai > tq 2 +ri infinitely 
often as p — > 00, then 1 would be a cluster point of LfdriTjdeai)- From q > (3 and (4.161 we 
know that FDR(Tid ea i) — ► 0. ^From the Tangent-Secant rule we know that Lfdr(Tideai) — > 1/2. 
(|4~l8| follows. □ 
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4.3 FDR/Lfdr properties of ideal threshold 

We now turn to Theorem 4. Important observation: 

The asymptotic Tideai — t q *ip) ■ (1 + "(1)) is simply too crude to determine the 
FDR and Lfdr properties ofTj^eaij it is necessary to consider the second-order effects 
implicit in the (1 + o(l)) term. For this, the Tangent- Secant formula is essential. 



Indeed (4.171 shows that the only possibilities for limiting local FDR of a threshold of the 
exact form t q {p) are 0, 1/2, 1. The actual local FDR of Tjdeai spans a continuum from [0, 1], due 
to the fact that small perturbations of t q (p)(l + o(l)) implicit in the o(l) can cause a change in 
the local FDR. To understand this, for q ^ r and s G (0, oo), put 

q(q, r,s,p)= [r 1 ' 2 ± yV /2 - r 1 / 2 ) 2 +log( s )/log(p) 

where the sign of ± is + if q > r. Clearly, q is well-defined for all sufficiently large p. As an 
example, q{q,r, l,p) — q. The peculiar definition ensures that 

Wqip) - trip)) = </>{tq(p) - trijp)) ' «• 

By simple algebra one can show 
Lemma 4.7 Let q,r, and s be fixed. With q = q(q,r, s,p), 

kip) , 

tqip) 

and 

4>ihip)) , 
nkip)) 

Let's put for short F s = F P Rit q ( q _ riS ^ p )ip)) and T s = TPR(tq( q ^ StP )(p)). Then, the last few 
displays show 

T' s = s-T[, F' S ~F{, p^oo. 
T s ~ s -Ti, F S ~F X , p^oo. 

Hence 

F F 
FDRit q{q ^^ p) ip)) = Ts ^ Fs ~ - ; Ti ~ Fi , P ^ oo; 

p' p' 
Lfdrit q(q ^ p) ip)) = — + s f , - - x + - , p -> oo. 

Choosing s appropriately, we can therefore obtain a perturbed q which perturbs the Lfdr and 
FDR. In fact there is a unique choice of s needed to ensure the Tangent-Secant formula. 

Lemma 4.8 For given values F\, F[, Ti ^ 0, T[ ^ 0, put 

T{ T x 

This choice of s obeys the Tangent- Secant rule: 

F{ _lf v , ft 



T{ + F{ 2 V s* ■ Ti + Fx 



To use this recall (4.15 )-(4.16 )-(4.17l. These formulas give expressions for Ti/Ti and T[jF[. 
Plugging in q — q* we get 

Corollary 4.1 We have 

Tldeal ^ ^g(g*,r,s*,p)(p)' 

where s* is obtained by setting T\ — i\/q* — ^/r)^ 1 , ft = 2/^/q, T{ = 1 and F[ = 2. Moreover, 
if/3 '• B, 

FDRiT Ideal ) ~ LfdriT Ideal ) ~ 

2r 4r 
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5 Connection of HC objective with Sep 



Let F = F e T be the two-point mixture of Definition 2.3 and G — G e T the corresponding folded 
distribution. Then in the asymptotic RW model we have, for t — t q (p): 



HC(t; F e<T ) 



G(t) - W) 
y/G(t)G(t) 



e(* T -* )(*) 



[((l-c)* + e*r)((l-c)*o- 

e(tt T -tt )(f) 
[((l- c )*o + e* r )]Va 
TPR(t) - j^FPR(t) 
[FPR(t) + TPRfa)] 1 / 2 

TPR{t) 
[TPR{t)+FPR(t)] 1 / 2 
Sep(t; e, r), p — > oo. 



e^r)] 1/2 



(5.1) 

(5.2) 
(5.3) 



Step (5.1l follows from G^ T (t q (p)) — > 1 as p —* oo. Step (5.2) follows from eFPR(t q (p)) = 
o(TPRjt q (p))) as p -> oo. Step Q follows from TPR(t q \p))~= o(TPR(t q (p))) as p oo. 

This derivation can be made rigorously correct when g = q*(r,(3), where g* is as announced 
in Theorem 2. With extra work, not shown here, one obtains: 

Theorem 5 The statements made for the ideal threshold in Theorems 1-3 are equally valid for 
the ideal HC threshold. 



6 Suboptimality of phase diagram for other methods 

Setting thresholding by control of the False Discovery rate is a popular approach. Another ap- 
proach, more conservative classical and probably even more popular, is the Bonferroni approach, 
which controls the expected total number of false features selected. Theorem 2 and 3 implicitly 
show the suboptimality of these approaches. The implications include an attractive relationship 
to the regions of Definition |3.3[ 



Theorem 6 Under the Asymptotic RW model ARW (r, /?) , the regions of the (r,(3) phase space 
for successful classification are: 

T > (1- y/1 -P) 2 , 0<)9<1. 

Remark. The successful region of Ideal FDRT and Bonferroni is smaller than that of HCT. 
However, even for regions when FDRT or Bonferroni would be successful, HCT stil yields more 
accurate classifications in terms of the convergence rate. This is especially important for finite-p 
performance. In short, Bonferroni and FDRT fail to adapt the difficulty level of the classification 
problem measured by (e, r), one picks a fixed threshold, another picks a fixed false discovery rate. 
Proof Sketch. We continue to use the notations q*(r,j3), 3i(r,/?),i = 1,2, j(r, (3) ji(r,f3), 



i = 1,2 from the proof of Lemma 4.2 The Bonferroni threshold — <f> 1 (l/p) = ti(p)(l + o(l)) as 



p — » oo. In this proof sketch, we analyze t\(jp) as if it were the Bonferroni threshold. Applying 



(4.7 1, and the definition of 7 we have 

S^pMp), e, r)-M= PL{p) ■ p'^\ p - 00. 



n 

Now 



7(1; r, /?) = min 7j(l;r,/3); 

1—1,2 
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while 7i(l;r,/3) = S(l;r,/3)/2 and 7 2 (l;r,/3) = <5(l;r,/3). Hence 7(l;r,/3) > iff 5(l;r,/3) > 0. 
But 

5(1; r, (3) = -p-r + 2^r, 

which is positive iff r > 1 — \/\ — (3. 

Precise FDR control at level a requires that TP/FP = (1 — a) /a. Suppose that r < j3. 
Equation (4.15) relates the TP/FP ratio to 70, q and r. Clearly, the only way to keep TP/FP 
bounded away from and infinity is to choose q so that 70(9; r,/3) — 0. We note that q^{r,P) 
exactly solves this problem: 

7o{q 2 (r,/3);r,/3) = 0. 

It follows that in the ARW(r,(3) model, the FDRT functional obeys Tpdr^^t) — t q2 {p){\ + 
o(l)), p — > 00. In this proof sketch we analyze t q2 (p) as if it were exactly the FDR threshold. 

7(92(^,/3); r,0) = ir>m')i{q 3 {r,0)\r,0)', 

1—1,2 



Hence j(q2',r,0) > is positive iff 



while ji( g 2 ;r, (3 ) = 1 - (/3 + r r) and 7 2 (g 2 ; r,fi) = 
r > 1 - v / T rr ?. 

The last paragraph assumed r < p. On inuitive grounds, the region r > (3 offers even better 
perfomance, so it must lie entirely above the phase transition. We omit details. □ 

Table [4] compare the exponents in SEP for different methods. Also see Figure [3] for a com- 
parison of the exponents for (3 = 1/2 and (3 — 5/8. 



Method 


SEP Exponent 


Boundary, 3/4 < (3 


Boundary, 1/2 < (3 < 3/4 


Ideal, HCT 
FDRT 
Bonferroni 


7 

1 {P+rf 

2 8r 

-j3-r + 2 s /r 


l-y/l-13 

1 - vi - P 


0-1/2 

1 - Vi - a 



Table 4: Comparison of the exponents in SEP. 



7 Discussion and conclusions 

7.1 Beyond ideal performance 

We consider here only asymptotic, ideal behavior. Conceptually, the ideal threshold envisions 
a situation with an oracle, who, knowing e and r and n and p, chooses the very best threshold 
possible under those given parameters. In this paper we have analyzed the behavior of this 
threshold within a certain asymptotic framework. However, clearly, no empirical procedure can 
duplicate the performance of the ideal threshold. 

We have seen that ideal HC thresholding comes close. This ideal HC threshold does not 
involve optimal exploitation of knowledge of e and r, but merely the availability of the underlying 
distribution of feature Z-scores F e/T . In this paper, we analyzed the behavior oft HC — Thc(F<l,t)- 

This is an ideal procedure because we never would know F e y, instead we would have the 
empirical CDF F n , p defined by 

1 P 

F n , p {z) = - ^2^{zu)<z}- 
P 3=1 



The (non-ideal) HCT that we defined in Section 1.5.2 is then simply 



Because F e ^ T (z) = E(F n ^ p )(z), we are conditioned to expect that t HC rj t^ p ; indeed there 
are generations of experience for other functionals T showing that we typically have T(F n ^ p ) sa 
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Figure 3: Comparison of SEP exponents for HCT, FDRT, and Bonferroni. Top: (3 = 1/2. 
Bottom: (3 — 5/8. x-axes displays r-values from to (3, y-axes display corresponding exponents. 
Solid blue horizontal bar indicates that r falls in the failure region where all exponents are 0. 



T(E(F ntP )) for large n,p for those functionals. Proving this for T — Thc is more challenging than 
one might anticipate; the problem is that Thc is not continuous at F — <I>, and yet F e rp) >T ( p ) ~~ * ^ 
as p — + oo. After considerable effort, we justify the approximation of HCT by ideal HCT in [12] . 
Hence the analysis presented here only partially proves that HCT gives near-optimal threshold 
feature selection; it explains the connection at the level of ideal quantities but not at the level 
of fluctuations in random samples. 

7.2 Other asymptotic settings 

In the analysis here we consider only the case that n ~ clog(p) 7 . The phase diagram will be 
slightly different in case n is bounded and does not go to infinity with p, and will be again 
slightly different in case n ~ cp 1 . The full details are presented in [12] . 

7.3 Phase diagram for finite sample sizes 

While the main focus of our paper has been asymptotic analysis, we mention that the phase 
diagram also reflects the finite-sample behavior. In Figure [4] we consider p = 3 x 10 3 x AT, 
N = (1, 10, 100). For such p, we take n — log(p)/2 and display the boundary of the set of (f3, r) 
where ideal HCT yields a classification error between 10% and 40%. The figure illustrates that 
as p grows, both the upper bound and the lower bound migrate towards the common limit curve 
r = p*{/3). 

7.4 Other work on HC 

HC was originally proposed for use in a detection problem which has nothing to do with threshold 
feature selection: testing an intersection null hypothesis = Vj §\. The literature has 
developed since then. Papers [2U] extends the optimality of Higher Criticism in detection 
to correlated setting. Wellner and his collaborators investigated thc Higher Criticism in the 
context of Goodness-of-fit , see for example [24]. Hall and his collaborators have investigated 



21 




p 

Figure 4: Classification boundaries for p = 3 x 10 3 x (1, 10, 100). For each p, both an upper 
boundary and a lower boundary are calculated which correspond to the proxy classification errors 
of 10% and 40% using ideal HCT. The gap between two boundaries gets smaller as p increases. 
The solid black curve displays the asymptotic boundary r = p* (/?) . 

Higher Criticism for robustness, see for example [5]. HC has been applied to data analysis in 
astronomy ([7, 26]) and computational biology [15] . 

HC has been used as a principle to synthesize new procedures: Cai, Jin, Low in [5] (see also 
Mcinshauscn and Rice |28J use Higher Criticism to motivate estimators for the proportion e of 
non-null effects. 

7.5 HC in classification 

Higher Criticism was previously applied to high-dimensional classification in Hall, Pittclkow, 
Ghosh (2008) [2T], but there is a key conceptual difference from the present paper. Our approach 
uses HC in classifier design - it selects features in designing a linear classifier; but the actual 
classification decisions are made by the linear classifier when presented with specific test feature 
vectors. The approach in [21] uses HC to directly make decisions from feature vector data. Let's 
call these two different strategies HC-based feature selection (used here) and HC-based decision 
(used in [2"T].) 

In the ARW model, for a given classifier performance level, the HC-based decision strategy 
requires much stronger feature contrasts to reach that level than the HC-based feature selection 
strategy. In this paper, we have shown that HC-feature-selection requires useful features to have 
contrasts exceeding 

VV(/?)logp/VMl + °(l)). (n,p)-> oo. 
HC-decision requires useful features to have contrasts exceeding 

VV(/3)logp- (l + o(l)), p^oo. 
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Therefore, if the number of training samples n > 1, HC-feature selection has an advantage. 
Imagine that we have n — 36 training samples; we will find that HC feature selection can 
asymptotically detect features roughly 1/6 the strength of using HC directly for decision. 
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